Skip to content

Conversation

@rahil-c
Copy link
Collaborator

@rahil-c rahil-c commented Apr 12, 2023

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua self-assigned this Apr 12, 2023
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since most of the changes update API usage only. Is there any compatibility issue with different Spark, Flink and Hadoop versions?

<groupId>com.amazonaws</groupId>
<artifactId>dynamodb-lock-client</artifactId>
<version>${dynamodb.lockclient.version}</version>
<version>1.2.0</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you keep the version variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in we do not specify a version and let it pull in the latest version?

return AmazonCloudWatchAsyncClientBuilder.standard()
.withCredentials(HoodieAWSCredentialsProviderFactory.getAwsCredentialsProvider(props))
private static CloudWatchAsyncClient getAmazonCloudWatchClient(Properties props) {
return CloudWatchAsyncClient.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does .standard() do before? Is it by default in the builder now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/aws/aws-sdk-java-v2/blob/master/docs/LaunchChangelog.md#4-service-changes.

If you look at the table for several of these aws clients the table shows a before column where the .standard was used in 1.x and now in 2.x it is replaced by using the .builder pattern. See this snippet taken from link above

Client builders no longer contain static methods. The static methods on the clients must be used: AmazonDynamoDBClientBuilder.defaultClient is now DynamoDbClient.create and AmazonDynamoDBClientBuilder.standard is now DynamoDbClient.builder.

StorageDescriptor partitionSD = sd.copy(copySd -> copySd.columns(newColumns));
final Instant now = Instant.now();
TableInput updatedTableInput = TableInput.builder()
.tableType(table.tableType())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table name is missed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, will fix this and check this class once more just to be safe for any misses like this

.withDatabaseName(databaseName)
.withTableInput(updatedTableInput);
StorageDescriptor sd = table.storageDescriptor();
StorageDescriptor partitionSD = sd.copy(copySd -> copySd.columns(newColumns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it makes a copy instead of in-place change. Any reason of doing this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SDK v2 the POJOS are immutable so I cant modify the original StorageDescriptor sd object and the only option is to make a clone and pass that.

From docs https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/migration-whats-different.html

Immutable POJOs
Clients and operation request and response objects are now immutable and cannot be changed after creation. To reuse a request or response variable, you must build a new object to assign to it.

* // ... start making calls to table ...
* </pre>
*/
public class DynamoTableUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to copy the code from the library?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately this is an issue.

Basically in hudi we were using the TableUtils class which was a src class in V1 that we would invoke in dynamodb locking feature, but now in v2 this class been moved as a test class see this commit. aws/aws-sdk-java-v2@973237c.

Because of this we can not access anymore this class anymore due to its test scope. I didnt see any replacement DynamoDB utility like classes in the SDK v2

So Currently im opted to resuse class over and left a comment to where this code can be found so we dont have to reinvent here. Let me know if you have any other ideas on how we can go about this.

Comment on lines +139 to +150

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>${aws.sdk.httpclient.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>${aws.sdk.httpcore.version}</version>
</dependency>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hudi-aws module already adds this so there is no need to add both dependencies again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense will remove these and run dependency tree to confirm.

Comment on lines +471 to +478
<!-- AWS Services -->
<!-- https://mvnrepository.com/artifact/software.amazon.awssdk/aws-java-sdk-sqs -->
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>sqs</artifactId>
<version>${aws.sdk.version}</version>
</dependency>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in hudi-aws module? Why is it required now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the events source classes inside the hudi utilities module will require this dependency.
https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java

but i see what you mean we dont have to place this dependency inside the utiltiies pom and instead we place it in hudi aws pom https://github.com/apache/hudi/blob/master/hudi-aws/pom.xml#L144

Comment on lines +156 to +159
<include>org.apache.hudi:hudi-aws</include>
<include>software.amazon.awssdk:*</include>
<include>org.apache.httpcomponents:httpclient</include>
<include>org.apache.httpcomponents:httpcore</include>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this. If AWS ecosystem is used, hudi-aws-bundle should be used with hudi-utilities-bundle: https://hudi.apache.org/releases/release-0.12.2#bundle-updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we basically remove all of this from the utilities bundle, and then its mandatory that user will have to pass both utilities bundle and aws bundle?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can remove this change here and user will pass bundle then.

@yihua yihua added dependencies Dependency updates aws-support labels Apr 12, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@rahil-c
Copy link
Collaborator Author

rahil-c commented Apr 19, 2023

@yihua

LGTM since most of the changes update API usage only. Is there any compatibility issue with different Spark, Flink and Hadoop versions?

So far I have only tested with Spark3 and Hadoop3. And I manually tested the aws hudi features like OCC, GlueSyncTool, CloudWatchMetricsReporter on EMR.

I can try looking into doing some tests with flink but just curious what combinations of version testing I should do before we land this?( Hadoop, Spark, Flink)

@yihua yihua added priority:blocker Production down; release blocker release-0.14.0 labels Jul 6, 2023
@yihua
Copy link
Contributor

yihua commented Aug 3, 2023

@yihua

LGTM since most of the changes update API usage only. Is there any compatibility issue with different Spark, Flink and Hadoop versions?

So far I have only tested with Spark3 and Hadoop3. And I manually tested the aws hudi features like OCC, GlueSyncTool, CloudWatchMetricsReporter on EMR.

I can try looking into doing some tests with flink but just curious what combinations of version testing I should do before we land this?( Hadoop, Spark, Flink)

Synced offline already sometime before. We should test Spark 2.4, Spark 3.x with Hadoop 2 on S3 to make sure there is no compatibility issue, especially around S3A file system.

@yihua
Copy link
Contributor

yihua commented Aug 3, 2023

Closing this in favor of #9347 .

@yihua yihua closed this Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

big-needle-movers dependencies Dependency updates priority:blocker Production down; release blocker release-0.14.0

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants